Skip to main content

The Dataset Collection

The Dataset Collection consists of large data archives from both sites and individuals.

1,407
RESULTS
rss


Media Type
5
collections
803
data
377
texts
150
movies
71
software
1
web
More right-solid
Year
57
2018
106
2017
34
2016
194
2015
14
2014
35
2013
More right-solid
Topics & Subjects
81
The Trivedi Effect®
78
Biofield Energy Treatment
60
X-ray diffraction
48
Fourier transform infrared spectroscopy
47
Biofield treatment
45
Biofield Energy Healing Treatment
More right-solid
Collection
More right-solid
Creator
261
mahendra kumar trivedi
48
discogs.org
21
coursera
11
stanford university
8
cesar roberto de souza
8
devin r. berg
More right-solid
Language
428
English
1
Portuguese
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
MusicBrainz Data Dumps
collection
510
ITEMS
56,395
VIEWS
collection
eye 56,395
The MusicBrainz Database is built on the PostgreSQL relational database engine and contains all of MusicBrainz' music metadata. This data includes information about artists, release groups, releases, recordings, works, and labels, as well as the many relationships between them. The database also contains a full history of all the changes that the MusicBrainz community has made to the data. Core data Artists Name, sort name, IPI, aliases, type, begin and end dates, disambiguation comment, MBID...
The Dataset Collection
data
eye 39,433
favorite 7
comment 3
(Here is the original Reddit comment announcing this collection of data and what the processes were.) This is an archive of Reddit comments from October of 2007 until May of 2015 (complete month). This reflects 14 months of work and a lot of API calls. This dataset includes nearly every publicly available Reddit comment. Approximately 350,000 comments out of ~1.65 billion were unavailable due to Reddit API issues. Q: How are the files structured? Each file is compressed with bzip2 compression....
favoritefavoritefavoritefavoritefavorite ( 3 reviews )
The Dataset Collection
by Gwern Branwen
data
eye 31,813
favorite 7
comment 0
Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services whose users transact in Bitcoin or other cryptocoins, usually for drugs or other illegal/regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model. From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; in addition, I made or obtained copies of as many...
Topics: Tor, Bitcoin, drugs, Silk Road, Evolution, Agora, black-markets, dark net markets
Dumps of DISCOGS.ORG Metadata (2008-Present)
collection
47
ITEMS
25,794
VIEWS
by DISCOGS.ORG
collection
eye 25,794
This is an unofficial mirror of the DISCOGS.ORG data collection, which is located at http://www.discogs.com/data/ . Discogs, short for discographies, is a website and database of information about audio recordings, including commercial releases, promotional releases, and bootleg or off-label releases. The Discogs servers, currently hosted under the domain name discogs.com, are owned by Zink Media, Inc., and are located in Portland, Oregon, USA. Discogs is one of the largest online databases of...
The Dataset Collection
by Internet Archive
data
eye 17,736
favorite 7
comment 1
Culled from various sources, this collection includes over one million JPG, PNG and GIF album covers. The resolution ranges from "thumbnail" through to very large sizes. Filenames are variant in usefulness, although a good number indicate at least the name of the original album. This dataset is for experimentation and image processing research only. At 148gb, the collection is large but not unmanageable (there is a torrent available) and allows a developer or artist to work with the...
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topics: dataset, big data, album covers, covers, cover art, cover photos
Internet Census 2012
collection
15
ITEMS
15,371
VIEWS
by Anonymous
collection
eye 15,371
Abstract While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address...
The Dataset Collection
by NYC Taxi and Limousine Commission
data
eye 11,019
favorite 2
comment 0
FOIA/FOILed Taxi Trip Data from the NYC Taxi and Limousine Commission 2013. Released by http://chriswhong.com/open-data/foil_nyc_taxi/ trip_data.7z and trip_fare.7z are more efficiently compressed versions of the data, you probably want these files. The data is in csv format. For the data files this includes the fields: medallion, hack_license, vendor_id, rate_code, store_and_fwd_flag, pickup_datetime, dropoff_datetime, passenger_count, trip_time_in_secs, trip_distance, pickup_longitude,...
Topics: data, nyc, taxi, fare, csv, FOIA, FOIL
Source: torrent:urn:sha1:6c594866904494b06aae51ad97ec7f985059b135
Academic Torrents
collection
741
ITEMS
7,483
VIEWS
by ACADEMICTORRENTS.COM
collection
eye 7,483
Welcome to Academic Torrents! Making 14.15TB of research data available. We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.
The Dataset Collection
data
eye 4,958
favorite 1
comment 0
I took the Reddit comment archive and converted all the JSON into one SQLite database using this program that I wrote: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc I ran a few tests to make sure the number of database rows matches the number of JSON records. "SELECT MAX(rowid) FROM comment" and "SELECT COUNT(id) FROM comment" both return 1659361605. This gives me some confidence as to the integrity of the dataset, but I cannot be 100% sure. The compressed size is 163G....
The Dataset Collection
by Weiwei Zhang, Jian Sun, and Xiaoou Tang
data
eye 3,918
favorite 2
comment 0
This dataset mirrored from http://137.189.35.203/WebUI/CatDatabase/catData.html, which circa May 2017 is a dead link. The original page is available in Wayback: https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html The CAT dataset includes 10,000 cat images. For each image, we annotate the head of cat with nine points, two for eyes, one for mouth, and six for ears. The detail configuration of the annotation was shown in Figure 6 of the original paper:...
Topics: cats, datasets, computer vision
The Dataset Collection
data
eye 3,292
favorite 0
comment 0
Large sets of malware examples for the purposes of research, comparison, and history. This is the Various set, which is a volume of specific smaller sets of malware.
The Dataset Collection
software
eye 2,679
favorite 1
comment 0
All the "journal article" DOIs from CrossRef's OAI-PMH server; URLs of just under 50 million journal articles.
Topics: doi, dataset
The Dataset Collection
by Ben
data
eye 2,341
favorite 0
comment 0
Ben's FTP List (May, 2018): This is a trimmed down list of all servers that are online and allow anonymous connections. There are 244441 FTP's in total Please note: It is unknown if these servers are online after the scan or are behind dynamic IP addresses, making it impossible to guarantee if they are available after this list was compiled. This census is provided as a series of bzip2 files, which can be read directly by utilities such as zmore and zless. It is both intended to be used for...
NIH Data Commons
collection
10
ITEMS
1,934
VIEWS
collection
eye 1,934
The Data Commons Pilot Phase Consortium (DCPPC) is an NIH project to tackle the challenges of data-driven and data-intensive biomedical research: The data sets are too large to download There's minimal interoperability between and across data set providers Local compute capacity often is too limited to meet dynamic research needs These challenges are preventing biomedical data from reaching its full potential in basic research, clinical, and translational medicine. DCPPC aims to improve this...
The Dataset Collection
data
eye 1,569
favorite 1
comment 0
A collection of fanfiction stories from fanfiction.net, repacked for easier bulk collecting and archiving. Contains many tens of thousands of fan fiction stories.
The Dataset Collection
data
eye 1,512
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com
Topics: UPC, Universal Product Code, barcode
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 1,235
favorite 0
comment 1
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20130103 and the data is from 2013-01-03.
( 1 reviews )
Internet Archive Bibliographic Metadata
data
eye 1,232
favorite 1
comment 0
This file is a snapshot dump of the Crossref DOI metadata API, containing entries for over 94 million DOIs. Compared to the previous 2017-03 version (see archive.org item "crossref_doi_dump_201703"), this snapshot has a few million more works, but the corpus size is much larger (29 GB compressed vs. 7 GB compressed) as it now contains significantly more citation data, due to the efforts of the Initiative for Open Citations (I4OC) project. This was generated by running the scripts...
The Dataset Collection
data
eye 1,213
favorite 0
comment 0
CITY OF SCREENSHOTS is a dataset of automatically screenshotted computer programs taken from the uploaded items in the Internet Archive's software collection. An ever-growing dataset, it currently contains over 450,000 individual screenshot images from over 60,000 software programs for the Apple II, Atari 8-Bit, MS-DOS, Windows, ZX Spectrum and Tandy Color Computer. Screenshots were generated by automated script: An instance of Firefox would be spun up and aimed at the emulated item on the...
Topics: screenshots, computers, screens, artwork
Internet Census 2012
software
eye 1,156
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 1,088
favorite 0
comment 1
MusicBrainz Database Dump 20141004-015144
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Internet Archive Bibliographic Metadata
data
eye 1,084
favorite 0
comment 0
A snapshot of the oaDOI DOI/URL database, including open access status for each paper. oaDOI is the API backing unpaywall; see oadoi.org for more details. This dataset is intended for NON-COMMERCIAL USE ONLY; contact oaDOI for details or commercial support.
Internet Census 2012
software
eye 938
favorite 0
comment 0
The Dataset Collection
data
eye 929
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com
Topics: UPC, Universal Product Code, barcode
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 862
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20110601 and the data is from 2011-06-01.
The Dataset Collection
data
eye 842
favorite 0
comment 0
Minecraft Archive Project: MINECRAFTFORUM.NET
Internet Archive Bibliographic Metadata
data
eye 818
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 804
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20130301 and the data is from 2013-03-01.
Internet Census 2012
software
eye 798
favorite 0
comment 0
Internet Census 2012
software
eye 782
favorite 0
comment 0
Internet Archive Bibliographic Metadata
data
eye 780
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Census 2012
software
eye 777
favorite 0
comment 0
Internet Census 2012
software
eye 771
favorite 0
comment 0
The Dataset Collection
by NIST NSRL
data
eye 770
favorite 0
comment 0
The National Software Reference Library (NSRL) is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This will help alleviate much of the effort involved in determining which files are important as evidence on computers or file systems that have...
MusicBrainz Data Dumps
data
eye 768
favorite 0
comment 0
MusicBrainz Database Dump 20150523-002918
Internet Census 2012
software
eye 767
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 763
favorite 0
comment 0
MusicBrainz Data Dump 20120623-002010
Topics: database, music, musicbrainz
The Dataset Collection
data
eye 760
favorite 0
comment 0
openSNP is a community-driven and community-owned project which lives through the contributions of its members. It is not associated with any institution and the only funding source for openSNP is continuous crowdfunding , which pays for the hosting costs. openSNP allows customers of direct-to-customer genetic tests to publish their test results, find others with similar genetic variations, learn more about their results by getting the latest primary literature on their variations, and help...
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 756
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. The data is in XML format and formatted according to the API spec: http://www.discogs.com/developers/. This dump has been generated and archived automatically. Official name is discogs_20080309 and the data is from 2008-03-09.
Internet Census 2012
software
eye 754
favorite 0
comment 0
Internet Census 2012
software
eye 751
favorite 0
comment 0
Internet Census 2012
software
eye 749
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 745
favorite 0
comment 0
MusicBrainz Database Dump 20140215-002226
Internet Census 2012
software
eye 743
favorite 0
comment 0
The Dataset Collection
data
eye 740
favorite 0
comment 0
Internet Census 2012
software
eye 740
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 734
favorite 0
comment 0
MusicBrainz Database Dump 20140329-003047
Internet Archive Bibliographic Metadata
data
eye 733
favorite 0
comment 0
Downloaded from https://core.ac.uk/services "The data aggregated from repositories by the CORE system can be accessed in two ways, through the CORE API or by downloading the data to your computer. The former option is practical if you want to build a service on top of CORE while the latter is something we recommend to those who would like to analyse the CORE dataset and/or apply some computationally intensive batch processes. If you use CORE in your work, we kindly request you to cite one...
Internet Archive Bibliographic Metadata
by ROAD: Directory of Open Access Scholarly Resources
data
eye 728
favorite 0
comment 0
This is a backup of ROAD/ISSN metadata from http://road.issn.org/en/contenu/download-road-records Dumps in both MARC XML and RDF format are included; see sub-directory for date of download. See also earlier July 2017 dump at: https://archive.org/download/road-issn-2017 These files are under the Creative Commons Attribution-NonCommercial 4.0 International Public License (aka, CC-BY-NC).
Topic: metadata
MusicBrainz Data Dumps
data
eye 726
favorite 0
comment 0
MusicBrainz Database Dump 20140924-014210
Internet Census 2012
software
eye 724
favorite 0
comment 0
Internet Census 2012
software
eye 712
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 711
favorite 0
comment 0
MusicBrainz Database Dump 20150408-002807
Internet Census 2012
software
eye 711
favorite 0
comment 0
MusicBrainz Data Dumps
data
eye 709
favorite 0
comment 0
MusicBrainz Database Dump: 20131012-002238
Internet Archive Bibliographic Metadata
by ORCID, Inc
data
eye 704
favorite 0
comment 0
This item contains an annual copy of the ORCID public data file, as originally downloaded from: https://orcid.org/content/download-file More details about this content and it's use available at: https://orcid.org/content/orcid-public-data-file This dataset is available under the public domain (CC-0). The DOI of this dataset is: https://doi.org/10.6084/m9.figshare.5479792
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 693
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20121101 and the data is from 2012-11-01.
MusicBrainz Data Dumps
data
eye 691
favorite 1
comment 0
MusicBrainz Database Dump 20150124-024159
MusicBrainz Data Dumps
data
eye 690
favorite 0
comment 0
MusicBrainz Database Dump 20140705-002713
Internet Archive Bibliographic Metadata
data
eye 689
favorite 0
comment 0
Manifest of Internet Archive's identified scholarly works in digital form (eg, journal articles). See README.html for details.
Internet Archive Bibliographic Metadata
by ISSN
data
eye 686
favorite 0
comment 0
Unlike most ISSN metadata, this mapping file is publicly available.
Dumps of DISCOGS.ORG Metadata (2008-Present)
by DISCOGS.ORG
software
eye 684
favorite 0
comment 0
This is the monthly dump of DISCOGS.ORG data, provided to the public domain. This dump has been generated and archived automatically. Official name is discogs_20120601 and the data is from 2012-06-01.
MusicBrainz Data Dumps
data
eye 662
favorite 0
comment 0
MusicBrainz Database Dump 20140716-002400
The Dataset Collection
by William W. Cohen, MLD, CMU
web
eye 594
favorite 2
comment 0
This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web , by the Federal Energy Regulatory Commission during its investigation. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of...
Topics: Enron, E-mail, Dataset
Internet Archive Bibliographic Metadata
by DIrectory of Open Access Journals
data
eye 592
favorite 0
comment 0
Downloaded from https://doaj.org/csv and the OAI-PMH interface. File names encode the date when data was downloaded.
The Dataset Collection
by Safecast
data
eye 566
favorite 0
comment 0
This item contains backups of measurement data from the Safecast environmental monitoring project: https://safecast.org From their webpage: "Safecast is an international, volunteer-centered organization devoted to open citizen science for the environment. After the devastating earthquake and tsunami which struck eastern Japan on March 11, 2011, and the subsequent meltdown of the Fukushima Daiichi Nuclear Power Plant, accurate and trustworthy radiation information was publicly unavailable....
Topics: radiation, community science
Internet Archive Bibliographic Metadata
by DIrectory of Open Access Journals
data
eye 537
favorite 0
comment 0
Downloaded from https://doaj.org/csv and the OAI-PMH interface.
The Dataset Collection
data
eye 449
favorite 0
comment 0
Database of UPC product codes, as compiled by upcdatabase.com http://sourceforge.net/projects/upcdatabase/
Topics: UPC, Universal Product Code, barcode
Internet Archive Bibliographic Metadata
by Datacite
data
eye 428
favorite 0
comment 0
This item contains snapshots of the Datacite OAI-PHM metadata feed, as captured with the tool 'metha'.
NIH Data Commons
movies
eye 420
favorite 0
comment 0
Videos from Day 1 of the May NIH DCPPC Collaboration Workshop The 2018 May DCPPC workshop took place on May 30-31 in Boston in the Countway Library of Medicine at the Harvard School of Public Health in Boston, MA. These videos are of the lightning talks at the start of the first day of the workshop. Each team provided a 5-ish minute summary of what they had been working on. Videos Charles Reid of Team Copper Cricket Sloan of Team Calcium Paul Avillach of Team Calcium Claris Castillo of Team...
Topics: dcppc, commonspilot
MusicBrainz Data Dumps
data
eye 410
favorite 0
comment 0
MusicBrainz Database Dump 20140611-002358
MusicBrainz Data Dumps
data
eye 410
favorite 0
comment 0
MusicBrainz Database Dump 20140806-002953
NIH Data Commons
movies
eye 396
favorite 0
comment 0
Videos from Day 2 of the May NIH DCPPC Collaboration Workshop The 2018 May DCPPC workshop took place on May 30-31 in Boston in the Countway Library of Medicine at the Harvard School of Public Health in Boston, MA. These videos are of the lightning talks at the start of the second day of the workshop. Each team provided a 5-ish minute summary of what they planned to work on next. Videos Rayna Harris from Team Copper Sasha Wait Zaranek from Team Copper Brad Heavner from TopMed Data Stewards Team...
Topics: dcppc, commonspilot
The Dataset Collection
by Stack Exchange, Inc.
data
eye 367
favorite 0
comment 0
This is intended to be an exact copy of the item "stackexchange" (https://archive.org/details/stackexchange) as of 2018-03-14. That item is continuously updated by Stack Exchange (which is great!); this snapshot could be helpful if something goes wrong with that process, or might be helpful for researchers if the upstream schema changes or to check for missing/changed data. See the "upstream" item for details and license/policy details.
MusicBrainz Data Dumps
data
eye 356
favorite 0
comment 0
MusicBrainz Database Dump: 20130612-002814